Projects in Data Science
Activity A3b: Joins

Hoang Anh Thai Vu

Joining Two Data Frames

Additional reading: Wickham and Grolemund on relational data or Chapter 10 of Data Computing by Kaplan

A join is a data verb that combines two tables.

There are several kinds of join.

Establishing a match between cases

A match between a case in the left table and a case in the right table is made based on the values in pairs of corresponding variables.

As an example, we’ll examine the following two tables on grades and courses. The Grades file has one case for each class of each student, and includes variables describing the ID of the student (sid), the ID of the session (section), and the grade received. The Courses table has variables for the ID of the session (section), the department (coded), the level, the semester, the enrollment, and the ID of the instructor (iid). We show a few random rows of each table below.

Table 1: Student grades.

sid sessionID grade
S31680 session3521 A-
S31242 session2933 C+
S32127 session2046 B
S32058 session2521 B+

Table 2: Information about each course section.

sessionID dept level sem enroll iid
session2568 J 100 FA2002 15 inst223
session1940 d 100 FA2000 16 inst409
session3242 m 200 SP2004 30 inst476
session3132 M 100 SP2004 18 inst257

Mutating joins

The first class of joins are mutating joins, which add new variables (columns) to the left data table from matching observations in the right table.1 There is also a right_join() that adds variables in the reverse direction from the left table to the right table, but we do not really need it as we can always switch the roles of the two tables.

The main difference in the three mutating join options in this class is how they answer the following questions:

  1. What happens when a case in the right table has no matches in the left table?
  2. What happens when a case in the left table has no matches in the right table?

Three mutating join functions:

When there are multiple matches in the right table for a particular case in the left table, all three of these mutating join operators produce a separate case in the new table for each of the matches from the right.

One of the most common and useful mutating joins in one that translates levels of a variable to a new scale. For example, below we’ll see a join that translates letter grades (e.g., “B”) into grade points (e.g., 3).

Example 1 (Average class size: varying viewpoints) Determine the average class size from the viewpoint of a student and the viewpoint of the Provost / Admissions Office.

Solution. The Provost counts each section as one class and takes the average of all classes. We have to be a little careful and cannot simply do mean(Courses$enroll), because some sessionIDs appear twice on the course list. Why is that?2 They are courses that are cross-listed in multiple departments! We can still do this from the data we have in the Courses table, but we should aggregate by sessionID first:

CourseSizes<-Courses %>%
  group_by(sessionID) %>% 
  summarise(total_enroll=sum(enroll))
mean(CourseSizes$total_enroll)
## [1] 21.45251

To get the average class size from the student perspective, we can join the enrollment of the section onto each instance of a student section. Here, the left table is Grades, the right table is CourseSizes, we are going to match based on sessionID, and we want to add the variable total_enroll. We’ll use a left_join since we aren’t interested in any sections from the CourseSizes table that do not show up in the Grades table; their enrollments should be 0, and they are not actually seen by any students. Note, e.g., if there were 100 extra sections of zero enrollments on the Courses table, this would change the average from the Provost’s perspective, but not at all from the students’ perspective.

If the by= is omitteed from a join, then R will perform a natural join, which matches the two table by all variables they have in common. In this case, the only variable in common is the sessionID, so we would get the same results by omitting the second argument. In general, this is not reliable unless we check ahead of time which variables the tables have in common. If two variables to match have different names in the two tables, we can write by=c(“name1”=“name2”).

EnrollmentsWithClassSize <- Grades %>% 
  left_join(CourseSizes, 
            by=c("sessionID"="sessionID")) %>%
  select(sid,sessionID,total_enroll)

Table 3: Student enrollments with the section size.

sid sessionID total_enroll
S31680 session3521 12
S31242 session2933 18
S32127 session2046 34
S32058 session2521 15
AveClassEachStudent<-EnrollmentsWithClassSize %>%
  group_by(sid) %>%
  summarise(ave_enroll = mean(total_enroll, na.rm=TRUE))

Table 4: Average class size seen by each individual student.

sid ave_enroll
S31677 23.83333
S31242 26.72727
S32121 23.33333
S32052 23.63636

The na.rm=TRUE here says that if the class size is not available for a given class, we do not count that class towards the student’s average class size. What is another way to capture the same objective? We could have used an inner_join instead of a left_join when we joined the tables to eliminate any entries from the left table that did not have a match in the right table.

Now we can take the average of the AveClassEachStudent table, counting each student once, to find the average class size from the student perspective:

mean(AveClassEachStudent$ave_enroll)
## [1] 24.41885

We see that the average size from the student perspective (24.4) is greater than the average size from the Provost’s perspective (21.5). It is a fun probability exercise to prove that this fact is always true!!

Filtering joins

The second class of joins are filtering joins, which select specific cases from the left table based on whether they match an observation in the right table.

A particularly common employment of these joins is to use a filtered summary as a comparison to select a subset of the original cases, as follows.

Example 2 (semi_join to compare to a filtered summary) Find a subset of the Grades data that only contains data on the four largest sections in the Courses data set.

Solution.

LargeSections<-Courses %>%
  group_by(sessionID) %>%
  summarise(total_enroll=sum(enroll)) %>%
  top_n(total_enroll, n=4)
GradesFromLargeSections <- Grades %>%
  semi_join(LargeSections)

Example 3 (semi_join) Use semi_join() to create a table with a subset of the rows of Grades corresponding to all classes taken in department J.

Solution. There are multiple ways to do this. We could do a left join to the Grades table to add on the dept variable, and then filter by department, then select all variables except the additional dept variable we just added. Here is a more direct way with semi_join that does not involve adding and subtracting the extra variable:

JCourses <- Courses %>%
  filter(dept=="J")
JGrades <- Grades %>% 
  semi_join(JCourses)  

Let’s double check this worked. Here are the first few entries of our new table:

Table 5: Student enrollments in Department J.

sid sessionID grade
S31185 session1791 A-
S31185 session1792 B+
S31185 session1794 B-
S31185 session1795 C+

The first entry is for session1791. Which department is that?

(Courses%>%filter(sessionID=="session1791"))
## # A tibble: 1 x 6
##     sessionID  dept level    sem enroll     iid
##         <chr> <chr> <int>  <chr>  <int>   <chr>
## 1 session1791     J   100 FA1993     22 inst223

But that only checked the first one. What if we want to double check all of the courses included in Table 5? We can add on the department and do a group by to count the number from each department in our table.

JGrades %>%
  left_join(Courses) %>%
  group_by(dept) %>%
  summarise(total=n())
## # A tibble: 1 x 2
##    dept total
##   <chr> <int>
## 1     J   148

More join practice

Exercise 1 (Which to join?) For each of these questions, say what tables you need to join and identify the corresponding variables.

  1. How many student enrollments in each department?

No need to join table. We can just use the Courses table and then group_by %>% summarise

  1. What’s the grade-point average (GPA) for each student?

We merge Grades and Courses by sessionID, so that we could identify the semester for each student. Then we group_by students and semester, and then average the grades.

  1. What fraction of grades are below B+?

We only need to look at Grades table, assuming we are doing this for the whole school across all time. If we would like to do it by terms or departments, then we need to merge Courses with Grades. Then we filter and look at grades.

  1. What’s the grade-point average for each department or instructor?

Again, merge Grades with Courses by sessionID. Then we group_by department or instructor, and take the average of grades.

Bicycle-Use Patterns

In this activity, you’ll examine some factors that may influence the use of bicycles in a bike-renting program. The data come from Washington, DC and cover the last quarter of 2014.

Two data tables are available:

A typical Capital Bikeshare station. This one is at Florida and California, next to Pleasant Pops. Figure 1: A typical Capital Bikeshare station. This one is at Florida and California, next to Pleasant Pops.

One of the vans used to redistribute bicycles to different stations. Figure 2: One of the vans used to redistribute bicycles to different stations.

Here is the code to read in the data:3 Important: To avoid repeatedly re-reading the files, start the data import chunk with {r cache = TRUE} rather than the usual {r}.

data_site <- 
  "https://tiny.cc/dcf/2014-Q4-Trips-History-Data-Small.rds" 
Trips <- readRDS(gzcon(url(data_site)))
Stations<-read_csv("https://tiny.cc./dcf/DC-Stations.csv")

The Trips data table is a random subset of 10,000 trips from the full quarterly data. Start with this small data table to develop your analysis commands. When you have this working well, you can access the full data set of more than 600,000 events by removing -Small from the name of the data_site.

Warm-up: Temporal patterns

It’s natural to expect that bikes are rented more at some times of day, some days of the week, sme months of the year than others. The variable sdate gives the time (including the date) that the rental started.

Exercise 2 (Single variable temporal plots) Make the following plots and interpret them:

  1. A density plot of the events versus sdate. Use ggplot() and geom_density().
  2. A density plot of the events versus time of day. You can use mutate with lubridate::hour(), and lubridate::minute() to extract the hour of the day and minute within the hour from sdate.
  3. A histogram of the events versus day of the week.
  4. Facet your graph from (b) by day of the week. Is there a pattern?

ggplot(Trips,aes(sdate))+
  geom_density()+
  labs(x="Date",title="Rentals by Date",subtitle="Washington D.C. Bike-renting")

Trips <- Trips %>%
  mutate(daytime=(hour(sdate)+minute(sdate)/60))
ggplot(Trips,aes(daytime))+
  labs(x="Hours",title="Rentals by hours",subtitle="Washington D.C. Bike-renting")+
  geom_density()

Trips <- Trips %>%
  mutate(weekday=wday(sdate))
c <- ggplot(Trips,aes(x=weekday))+
  geom_density()+
  labs(x="Week Day",title="Rentals by Week Day",subtitle="Washington D.C. Bike-renting")
c

ggplot(Trips,aes(daytime))+
  labs(x="Hours",title="Rentals by hours",subtitle="Washington D.C. Bike-renting")+
  geom_density()+
  facet_grid(weekday~.)

People go out earlier during the weekends!!!!

The variable client describes whether the renter is a regular user (level Registered) or has not joined the bike-rental organization (Causal). Do you think these two different categories of users show different rental behavior? How might it interact with the patterns you found in Exercise 2?

Exercise 3 (Customer segmentation) Repeat the graphic from Exercise 2 (d) with the following changes:

  1. Set the fill aesthetic for geom_density() to the client variable. You may also want to set the alpha for transparency and color=NA to suppress the outline of the density function.
  2. Now add the argument position = position_stack() to geom_density(). In your opinion, is this better or worse in terms of telling a story? What are the advantages/disadvantages of each?
  3. Rather than faceting on day of the week, create a new faceting variable like this: mutate(wday = ifelse(lubridate::wday(sdate) %in% c(1,7), "weekend", "weekday")). What does the variable wday represent? Try to understand the code.
  4. Is it better to facet on wday and fill with client, or vice versa?
  5. Of all of the graphics you created so far, which is most effective at telling an interesting story?

ggplot(Trips,aes(daytime))+
  labs(x="Hours",title="Rentals by hours",subtitle="Washington D.C. Bike-renting")+
  geom_density(aes(fill=client),alpha=0.2)

ggplot(Trips,aes(daytime))+
  labs(x="Hours",title="Rentals by hours",subtitle="Washington D.C. Bike-renting")+
  geom_density(aes(fill=client),alpha=0.2,position=position_stack())

It depends. If we would like to observe the distribution of each client group, the first one is better. But the second one tells a good story about how the fraction between the two groups change during the day within the overall distribution.

Trips <- Trips %>%
  mutate(wday = ifelse(lubridate::wday(sdate) %in% c(1,7), "weekend", "weekday"))

The code separate weekend and weekday.

c<-ggplot(Trips,aes(daytime))+
  labs(x="Hours",title="Rentals by hours",subtitle="Washington D.C. Bike-renting")+
  geom_density(aes(fill=client),alpha=0.2)+
  facet_grid(wday~.)
c

ggplot(Trips,aes(daytime))+
  labs(x="Hours",title="Rentals by hours",subtitle="Washington D.C. Bike-renting")+
  geom_density(aes(fill=wday),alpha=0.2)+
  facet_grid(client~.)

I prefer the facet by weekday, so that I could compare the two types of clients. But each tells a different story.

I really like the facet by weekday, fill by client. It shows the different in behaviors between client types, weekday and weekend. Basically, casual renters behave like registered renters during weekend.

c

Mutating join practice: Spatial patterns

Exercise 4 (Visualization of bicycle departures by station) Use the latitude and longitude variables in Stations to make a visualization of the total number of departures from each station in the Trips data. To layer your data on top of a Google map, start your plotting code as follows:

totalTrips<-Trips%>%
  group_by(sstation)%>%
  summarise(totalDep=n())
Stations <- Stations %>%
  left_join(totalTrips,by=c("name"="sstation"))
myMap <- get_map(location="Logan Circle",source="google",maptype="roadmap",zoom=13)
ggmap(myMap)+
  geom_point(data=Stations,aes(x=long, y=lat, color=totalDep,size=totalDep),alpha=0.6)+
  scale_colour_gradient(low="#323FFA",high="#D50000")+
  labs(title="Departures from Each Stations",subtitle="Washington D.C. Rental Bikes")
## Warning: Removed 165 rows containing missing values (geom_point).

Exercise 5 Only 14.4% of the trips in our data are carried out by casual users.4 We can compute this statistic via mean(Trips$client=="Casual"). Create a map that shows which area(s) of the city have stations with a much higher percentage of departures by casual users. Interpret your map.

casualTrips<-Trips%>%
  group_by(sstation)%>%
  summarise(casual.fraction=mean(client=="Casual"))
Stations <- Stations%>%
  left_join(casualTrips,by=c("name"="sstation"))
ggmap(myMap)+
  geom_point(data=Stations,aes(x=long, y=lat, color=casual.fraction,size=casual.fraction),alpha=0.6)+
  scale_colour_gradient(low="#323FFA",high="#D50000")+
  labs(title="Casual Client Departures from Each Stations",subtitle="Washington D.C. Rental Bikes",
       color="Fraction of Casual Clients",size="Fraction of Casual Clients")
## Warning: Removed 165 rows containing missing values (geom_point).

We could see that the areas with most percentage of casual clients are near the parks and tourist areas.

Filtering join practice: Spatiotemporal patterns

Exercise 6 (High traffic points) 5 Hint for part(a): as_date(sdate) converts sdate from date-time format to date format.

  1. Make a table with the ten station-date combinations (e.g., 14th & V St., 2014-10-14) with the highest number of departures, sorted from most departures to fewest.
  2. Use a join operation to make a table with only those trips whose departures match those top ten station-date combinations from part (a).
  3. Group the trips you filtered out in part (b) by client type and wday (weekend/weekday), and count the total number of trips in each of the four groups. Interpret your results.
Trips <- Trips %>%
  mutate(sday=date(sdate))
top10Dep <- Trips%>%
  group_by(sstation,sday)%>%
  summarise(totalDep=n())%>%
  arrange(desc(totalDep))%>%
  head(n=10)
knitr::kable(
  top10Dep[,],caption="Top 10 Departures By Day"
)

Table 6: Top 10 Departures By Day

sstation sday totalDep
Columbus Circle / Union Station 2014-11-12 11
Jefferson Dr & 14th St SW 2014-12-27 9
Lincoln Memorial 2014-10-05 9
Lincoln Memorial 2014-10-09 8
17th St & Massachusetts Ave NW 2014-10-06 7
Columbus Circle / Union Station 2014-10-02 7
Georgetown Harbor / 30th St NW 2014-10-25 7
Massachusetts Ave & Dupont Circle NW 2014-10-01 7
New Hampshire Ave & T St NW 2014-10-16 7
14th & V St NW 2014-11-07 6
top10Trips <- Trips %>%
  semi_join(top10Dep,by=c("sstation"="sstation","sday"="sday"))
top10Client <- top10Trips %>%
  group_by(client,wday)%>%
  summarise(totalDep=n())
ggplot(top10Client,aes(x=client,y=totalDep,fill=wday))+
  geom_col(position="stack")+
  labs(title="Spatiotemporal Renting Pattern",subtitle="For Top 10 Departures Location & Day", 
       fill="Day Type",x="Client Type",y="Total Departures")

Among the top 10 combination of day & location departures, most clients are registered and rent bikes on weekday. Casual clients in this group mostly rent bike during weekend.